Evaluating SEE - A Benchmarking System for Document Page Segmentation
نویسندگان
چکیده
The decomposition of a document into segments such as text regions and graphics is a significant part of the document analysis process. The basic requirement for rating and improvement of page segmentation algorithms is systematic evaluation. The approaches known from the literature have the disadvantage that manually generated reference data (zoning ground truth) are needed for the evaluation task. The effort and cost of the creation of these data are very high. This paper describes the evaluation system SEE and presents an assessment of its quality.. The system requires the OCR generated text and the original text of the document in correct reading order (text ground truth) as input. No manually generated zoning ground truth is needed. The implicit structure information that is contained in the text ground truth is used for the evaluation of the automatic zoning. Therefore, an assignment of the corresponding text regions in the text ground truth and those in the OCR generated text (matches) is sought. A fault tolerant string matching algorithm underlies a method, able to tolerate OCR errors in the text. The segmentation errors are determined as a result of the evaluation of the matching. Subsequently, the edit operations which are necessary for the correction of the recognized segmentation errors are computed to estimate the correction costs. Furthermore, SEE provides a version of the OCR generated text, that is corrected from the detected page segmentation errors.
منابع مشابه
Persian Printed Document Analysis and Page Segmentation
This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...
متن کاملA Region-based System for the Automatic Evaluation of Page Segmentation Algorithms
A method for automatically evaluating the quality of document page segmentation algorithms is described. Page segmentation involves decomposing a page into its structural and logical units such as paragraphs, halftones, captions and tables. These units are then ordered and logically associated. These two steps are very important in a document recognition strategy. Many di erent techniques have ...
متن کاملBenchmarking page segmentation algorithms
A method for automatically evaluating the quality of document page segmentation algorithms is introduced. Many different zoning techniques are now available, but there exists no robust method to benchmark and evaluate them reliably. Our proposed strategy is a region-based approach, in which segmentation results are compared with manually generated "ground truth files", describing all possible c...
متن کاملGround-truthing and benchmarking document page segmentation
We describe a new approach for evaluating page segmentation algorithms. Unlike techniques that rely on OCR output, our method is region-based: the segmentation output, described as a set of regions together with their types, output order etc., is matched against the pre-stored set of ground-truth regions. Misclassifications, splitting, and merging of regions are among the errors that are detect...
متن کاملOn Benchmarking of Invoice Analysis Systems
An approach is presented to guide the benchmarking of invoice analysis systems, a specific, applied subclass of document analysis systems. The state of the art of benchmarking of document analysis systems is presented, based on the processing levels: Document Page Segmentation, Text Recognition, Document Classification, and Information Extraction. The restriction to invoices enables and require...
متن کامل